System and method for textual analysis of images

ABSTRACT

Segmentation first breaks the images into segments or regions, with the segments of the region having text or symbols. The segmented image is separately applied to two different CNN-based models. Each model produces text boxes where potential text might exist. Then, a selective NMS algorithm is applied to the output of each model to produce a final group of text regions. These text regions are analyzed and actions taken.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of Indian Provisional ApplicationNo. 201941030477, filed Jul. 29, 2019, and U.S. Provisional ApplicationNo. 62/902,745, filed Sep. 19, 2019, both of which are incorporatedherein by reference in their entireties.

TECHNICAL FIELD

These teachings relate to analyzing images in text and performingactions as a result of the analysis.

BACKGROUND

Detecting regions of text and extracting that information from naturalimages is a challenging problem due to the presence of multiple types oftext in various shapes and sizes and having many other visual objectstogether. One area of text extraction is extracting text from productimages, particularly in e-commerce, especially, when the intent is toextract brand information, product type, and/or various other attributesfrom a product label.

Large e-commerce companies sell billions of products through theirwebsites. All these products are associated with one or more productimages containing various textual information about them. Extractingthis information not only to enhances the quality of the productcatalogue but also facilitates comparison of accurate productinformation with respect to various compliance policies of theorganization. One of the primary requisites to extract information fromproduct images is to extract text from those images with high accuracyand coverage.

BRIEF DESCRIPTION OF THE DRAWINGS

The above needs are at least partially met through the provision ofapproaches for analysing images, wherein:

FIG. 1 comprises a diagram of a system as configured in accordance withvarious embodiments of these teachings;

FIG. 2 comprises a flowchart as configured in accordance with variousembodiments of these teachings;

FIG. 3 comprises a diagram of a system as configured in accordance withvarious embodiments of these teachings;

FIG. 4 comprises a diagram of a system as configured in accordance withvarious embodiments of these teachings;

FIG. 5 comprises a flowchart as configured in accordance with variousembodiments of these teachings.

DETAILED DESCRIPTION

Generally speaking and in the approaches presented herein, images ofproducts are obtained from vendors or other entities. Segmentation firstbreaks the images into segments or regions, with the segments or theregion having text or symbols. The segmented image is then separatelyapplied to two different mathematical models (e.g., CNN-based models).Each model produces text boxes where potential text might exist. Then, aselective NMS algorithm is applied to the output of each model toproduce a final group of text regions. These text regions are analyzedand actions taken. The actions could be to apply the informationalcontent of the text to modify a website (e.g., a product catalog at thewebsite), or to detect offensive language in the text. If offensivelanguage is determined to exist, the vendor may be alerted and if theitem already exists in a store or warehouse, the item can be removed.

The approaches presented herein provide an end-to-end text detectionstrategy combining a segmentation algorithm and an ensemble of multipletext detectors of different types to detect text in every individualimage segment independently. In aspects, these approaches involve asuper-pixel-based image segmenter which splits an image into multipleregions. In some examples, a convolutional deep neural architecture isdeveloped that works on each of the segments and detects texts ofmultiple shapes, sizes and structures. It outperforms previous methodsin terms of coverage in detecting texts in images especially the oneswhere the text of various types and sizes are compacted in a smallregion along with various other objects. Furthermore, the text detectionand text recognizer approaches provided herein outperform previousapproaches in extracting text from high entropy images. Entropy can bedefined as the average information in an image and can be determinedapproximately from a histogram of multichannel and colour-spacefeatures. High entropy can be thought of, in one example, when theaverage information exceeds a threshold.

In other aspects, an ensemble modelling approach for feature extractionis provided by combining multiple Convolutional Neural Network (CNN)based models using selective non-maximal suppression. Ensemble modelalgorithm for detecting text of varying scales are utilized in each ofthe segments. In other aspects, a segmentation algorithm, which istailor-fitted to segment out the regions of the image containing text,is provided. Additionally, an ensemble of multiple neural networks,which extract features from segments of the image, is utilized in orderto detect text of varied sizes in the form of compact bounding boxes.The proposed architecture is highly parallelizable, and the results showcomparable, and in cases, better accuracies in comparison to thecompetitors.

In many of these embodiments, a system includes a data storage unit, anelectronic communication network, an electronic server, and a controlcircuit. The data storage unit includes a trained first mathematicalmodel and a trained second mathematical model. The first mathematicalmodel is different and distinct from the second mathematical model.

The electronic server is coupled to the electronic communication networkand hosts a web-based catalog ordering system that receives electronicorders from customers.

The control circuit is coupled to the electronic communication networkand the data storage unit. The control circuit is configured to receivean image of a product from a vendor via the electronic communicationnetwork. In aspects, the product is proposed by the vendor to be sold toretail customers.

The control circuit is configured to perform segmentation on the imageto divide the image into individual regions of homogeneous pixels. Thesegmentation is effective to create a segmented image.

The control circuit is further configured to apply the segmented imageto the first mathematical model to produce a first group of text regionsand apply the segmented image to the second mathematical model to obtaina second group of text regions. Each of the text regions are regionsthat include potential text or symbols.

The control circuit is still further configured to apply a selectivenon-maximal suppression (sNMS) algorithm to the first group of textregions and the second group of text regions to obtain a final group oftext regions. The selective NMS algorithm is effective to removeoverlapping regions at the same location or general location in theimage. The selective NMS algorithm selects text regions most likely toinclude text.

The control circuit is yet further configured to analyze informationalcontent of the text regions and perform an action that utilizes theinformational content of the text regions. The action can be applyingthe informational content to the web-based ordering catalog, receiving acustomer order from a customer as a result of the informational content,and physically fulfilling the received customer orders using anautomated order fulfilment system to ship items in the order to thecustomer. In another example, the action is scanning the informationalcontent for offensive content, and sending a message to a vendor via theelectronic network to remove the offensive content or removing the itemfrom a retail store or warehouse when an item including the offensivecontent exists in the retail store or warehouse.

In aspects, the item that is removed from the retail store or warehouseis removed using an automated vehicle to navigate to the item and removethe item from a display unit or storage unit. In examples, the automatedvehicle is an automated ground vehicle or an aerial drone. Otherexamples are possible.

In other aspects, the first group of text regions, the second group oftext regions, and the final group of text regions comprise text boxes.In still other examples, the first mathematical model and the secondmathematical model are convolutional neural networks (CNNs). Otherexamples are possible.

In examples, the first mathematical model and the second mathematicalmodel are trained using training images.

In still other aspects, the system further comprises a camera. Thecamera is coupled to the electronic communication network and isconfigured to obtain the image.

In others of these embodiments, a data storage unit that includes atrained first mathematical model and a trained second mathematical modelis provided. The first mathematical model is different and distinct fromthe second mathematical model.

An electronic communication network and an electronic server that iscoupled to the electronic communication network are provided. The serverhosts a web-based catalog ordering system that receives electronicorders from customers.

A control circuit that is coupled to the electronic communicationnetwork and the data storage unit is also provided. At the controlcircuit, an image of a product from a vendor via the electroniccommunication network is received. The product is proposed by the vendorto be sold to retail customers.

At the control circuit, segmentation is performed on the image to dividethe image into individual regions of homogeneous pixels. Thesegmentation is effective to create a segmented image.

At the control circuit, the segmented image is applied to the firstmathematical model to produce a first group of text regions. Thesegmented image is applied to the second mathematical model to obtain asecond group of text regions. Each of the text regions are regions thatinclude potential text or symbols.

At the control circuit, a selective non-maximal suppression (sNMS)algorithm is applied to the first group of text regions and the secondgroup of text regions to obtain a final group of text regions. Theselective NMS algorithm is effective to remove overlapping regions atthe same location or general location in the image. The selective NMSalgorithm selects text regions most likely to include text.

At the control circuit, informational content of the text regions isanalyzed and an action determined and performed that utilizes theinformational content of the text regions.

The action can be applying the informational content to the web-basedordering catalog, receiving a customer order from a customer as a resultof the informational content, and physically fulfilling the receivedcustomer orders using an automated order fulfilment system to ship itemsin the order to the customer.

In another example, the action can be scanning the informational contentfor offensive content, and sending a message to a vendor via theelectronic network to remove the offensive content or removing the itemfrom a retail store or warehouse when an item including the offensivecontent exists in the retail store or warehouse. Other examples ofactions are possible.

Referring now to FIG. 1, a system 100 for analyzing images is described.The system 100 includes a data storage unit 102, an electroniccommunication network 104, an electronic server 106, and a controlcircuit 108.

The data storage unit 102 is any type of electronic memory storagedevice. The data storage unit 102 includes a trained first mathematicalmodel 110 and a trained second mathematical model 112. The trained firstmathematical model 110 is different and distinct from the trained secondmathematical model 112. In examples, the trained first mathematicalmodel 110 is more accurate in results it provides than the trainedsecond mathematical model 112. In other examples, the first mathematicalmodel 110 provides a different structure (and thereby provides in someinstances non-identical results) as compared to the second mathematicalmodel 112.

The electronic communication network 104 is any type of electroniccommunication network such as the internet, a wireless network, a localarea network, a wide area network, a cellular network, or combinationsof these or other networks. Other examples of networks are possible.

The electronic server 106 is coupled to the electronic communicationnetwork 104 and hosts a web-based catalog ordering system that receiveselectronic orders from customers. The electronic server 106 may includecontrol circuits, transceivers, other types of network communicationdevices, and/or electronic memory that allow it to host, control,interact with, or present an internet based catalog. The catalog hasvarious types of information concerning products. The electronic server106 may present the catalog to customers via the network 104 and receivecustomer orders via the network 104. As described herein, one result ofthe analysis of textual information is modification to (additions to,changes to, or deletions to) the electronic catalog. In structure, thecatalog may be presented on at web pages that are presented to and allowinteraction with customers. It will be appreciated that the displayedcatalog will potentially change as the textual information is processed.

The control circuit 108 is coupled to the electronic communicationnetwork 104 and the data storage unit 102. It will be appreciated thatas used herein the term “control circuit” refers broadly to anymicrocontroller, computer, or processor-based device with processor,memory, and programmable input/output peripherals, which is generallydesigned to govern the operation of other components and devices. It isfurther understood to include common accompanying accessory devices,including memory, transceivers for communication with other componentsand devices, etc. These architectural options are well known andunderstood in the art and require no further description here. Thecontrol circuit 108 may be configured (for example, by usingcorresponding programming stored in a memory as will be well understoodby those skilled in the art) to carry out one or more of the steps,actions, and/or functions described herein.

The control circuit 108 is configured to receive an image of a productfrom a vendor via the electronic communication network. The image may bein any format, e.g., a JPEG file. The product is proposed by the vendorto be sold to retail customers. In one example, the vendor of theproduct wishes to sell the product.

The control circuit 108 is also configured to perform segmentation onthe image to divide the image into individual regions of homogeneouspixels. The segmentation is effective to create a segmented image asdescribed elsewhere herein.

The control circuit 108 is further configured to apply the segmentedimage to the first mathematical model to produce a first group of textregions and apply the segmented image to the second mathematical modelto obtain a second group of text regions. Each of the text regions areregions that include potential text or symbols. In aspects, the firstgroup of text regions, the second group of text regions, and the finalgroup of text regions comprise text boxes. In still other examples, thefirst mathematical model and the second mathematical model areconvolutional neural networks (CNNs). Other examples of models arepossible.

The control circuit 108 is still further configured to apply a selectivenon-maximal suppression (sNMS) algorithm to the first group of textregions and the second group of text regions to obtain a final group oftext regions. The selective NMS algorithm is effective to removeoverlapping regions at the same location or general location in theimage. The selective NMS algorithm selects text regions most likely toinclude text.

The control circuit 108 is yet further configured to analyzeinformational content of the text regions and perform an action thatutilizes the informational content of the text regions. The action canbe applying the informational content to the web-based ordering catalog,receiving a customer order from a customer as a result of theinformational content, and physically fulfilling the received customerorders using an automated order fulfilment system to ship items in theorder to the customer.

In another example, the action is scanning the informational content foroffensive content, and sending a message to a vendor via the electronicnetwork to remove the offensive content or removing the item from aretail store or warehouse when an item including the offensive contentexists in the retail store or warehouse.

In aspects, the item that is removed from the retail store or warehouse120 is removed using an automated vehicle 122 to navigate to the itemand remove the item from a display unit or storage unit. In examples,the automated vehicle 122 is an automated ground vehicle or an aerialdrone. In aspects, the automated vehicle 122 may include levers, grips,arms, suction grips, and other mechanical features that allow thevehicle 122 to retrieve, move, grip, and/or transport products. It willalso be understood that the movement and/or actions of the automatedvehicle 122 represents with the retail store or warehouse 120 and/oritems and fixtures (e.g., shelves) in the retail store or warehouse 120.Navigation may be made so as to avoiding collisions with humans orobjects.

In examples, the first mathematical model 110 and the secondmathematical model 112 are trained using training images. In aspects, atraining set is created for multiple models by sampling an originaltraining dataset and applying a “sampling with replacement” approach. Inthis approach, when a sampling unit is drawn from a finite populationand is returned to that population, after its characteristic(s) havebeen recorded, before the next unit is drawn, the sampling is said to be“sampling with replacement.”

In still other aspects, the system further comprises a camera 124. Thecamera 124 is coupled to the electronic communication network 106 and isconfigured to obtain the images. It will be appreciated that the imagesmay be of any type or format (e.g., a JPEG format).

Referring now to FIG. 2, one example of an approach for obtainingtextual content in images is described. At step 202, a data storage unitthat includes a trained first mathematical model and a trained secondmathematical model is provided. The first mathematical model isdifferent and distinct from the second mathematical model. For example,the models may be structured differently (e.g., different layers, numberof layers, or weights) so that when the same input is applied to each,each will not necessarily produce the same output.

At step 204, an electronic communication network and an electronicserver that is coupled to the electronic communication network areprovided. The server hosts a web-based catalog ordering system thatreceives electronic orders from customers. The server manages thecatalog ordering system. For example, the server manages the content ofan electronic catalog, presents the catalog to customers, receives andmanages orders, and sends electronic messages to order products that areshipped to customers.

At step 206, a control circuit that is coupled to the electroniccommunication network and the data storage unit is also provided.

At step 208 and at the control circuit, an image of a product from avendor via the electronic communication network is received. The vendorproposes that the product is to be sold to retail customers. In oneexample, the vendor is a supplier, and a retail chain receives productproposals from vendors. The image may include potentially multiple viewsof the product including labels on the product, other markings on theproduct, the shape of the product, the colour of the product, and/or thepackaging of the product. It will be appreciated that multiple imagesmay be provided to show various features (e.g., one image may show theproduct itself, while another image may show the packaging, e.g., a box,of the product).

At step 210 and at the control circuit, segmentation is performed on theimage to divide the image into individual regions of homogeneous pixels.The segmentation is effective to create a segmented image. Thesegmentation process is described in greater detail elsewhere herein.

At step 212 and at the control circuit, the segmented image is appliedto the first mathematical model to produce a first group of textregions. The segmented image is applied to the second mathematical modelto obtain a second group of text regions. Each of the text regions areregions that include potential text or symbols. In other words,application of the same input (the segmented image) to each modelproduces first and second groups of text regions that may or may not bethe same. By using models that provide potentially different results,greater accuracy in determining textual areas is obtained.

At step 214 and at the control circuit, a selective non-maximalsuppression (sNMS) algorithm is applied to the first group of textregions and the second group of text regions to obtain a final group oftext regions. The selective NMS algorithm is effective to removeoverlapping regions at the same location or general location in theimage. The selective NMS algorithm selects text regions most likely toinclude text.

At step 216 and at the control circuit, informational content of thetext regions is analyzed and an action determined and performed thatutilizes the informational content of the text regions. Various types ofactions are possible.

For example, the action can be applying the informational content to theweb-based ordering catalog, receiving a customer order from a customeras a result of the informational content, and physically fulfilling thereceived customer orders using an automated order fulfilment system toship items in the order to the customer.

In another example, the action can be scanning the informational contentfor offensive content, and sending a message to a vendor via theelectronic network to remove the offensive content or removing the itemfrom a retail store or warehouse when an item including the offensivecontent exists in the retail store or warehouse. Other examples ofactions are possible.

Referring now to FIG. 3, one example of an image processing process isdescribed. An image 302 is applied to a segmentation process that isapplied to obtain a segmented image 304. The segmented image is appliedto a first mathematical model 306 and a second mathematical model 308.Selective non-maximal suppression 310 is applied to the outputs of themodels 306 and 308 to obtain a final group of text boxes 312.

Referring now to FIG. 4, one example of applying these approaches isdescribed. Three different images are shown in columns 402, 404, and406. For each of these images, an actual image is shown in row 412, adilated image is shown in row 414, and a segmented image is shown in row416.

Referring now to FIG. 5, the segmentation process is described. Theproposed segmentation method consists of a number of intermediate stepsresulting in spatially connected segments of homogeneous pixels. Onegoal of the segmentation module primarily is to ensure that the completeimage is segmented into a number of regions such that the different textobjects are enclosed in the individual regions. To ensure spatialcontinuity in the segments, dilation of the image is first performed tomake sure the holes and gaps inside objects are nullified and smallintrusions at object boundaries are somewhat smoothened. Super-pixels offixed size in the dilated image are considered in order to calculatevarious features summarising super-pixel level information. Based on thesuper-pixel level feature information on the dilated image, a GaussianMixture model is fitted to identify the class of super-pixels in anunsupervised manner. The details of these steps are described below.

At step 502, dilation is performed. In aspects, the image dilation isperformed by convolving the image with a suitable kernel, in oneexample, a Gaussian kernel. The anchor point of the kernel is chosen tobe the center of the kernel. As the chosen kernel is scanned over theimage, the pixel value at the anchor point is replaced by the maximumpixel value of the image region overlapping the kernel. This results inthe interesting regions of the image to grow and the holes and gapswithin the object to get nullified. As a result, the segments having theobjects are over-compensated making sure there is less chance of asegment to truncate objects inside its true boundaries and split atholes and gaps within the object.

At step 504, super-pixel features are considered. For each super-pixelof fixed size s∈S, a calculation is made of a set of features as x_(s).For each super pixel s and each of the colour channels c∈C, calculationsare obtained of mean, standard deviation and energy, denoted by

x _(s,c) ⁽¹⁾=[μ_(s,c),σ_(s,c) ,e _(s,c)].

To summarize the texture features and in some aspects, a Leung-Malikfilter bank at multiple scales and orientations is considered. In totaland in some examples, considerations of first and second derivatives at6 orientations, 8 Laplacians of Gaussian filters and 4 Gaussians aremade, and, hence, the convolution is taken with the pixel at differentchannels. To make sure there is orientation invariance, the maximumresponse over all orientations at each pixel is taken. Calculations aremade of the mean, standard deviation and energy for all pixels within asuper-pixel for all the filter convolution to get features:

x _(s,c) ⁽²⁾=[μ_(s,c,j),σ_(s,c,j) ,e _(s,c,j)]j∈ℑ

-   -   for all colour channels c∈C and super-pixel s∈S. The combined        feature set for a given super-pixel is given by:

x _(s)=[x _(s,c) ⁽¹⁾ ,x _(s,c) ⁽²⁾]_(c∈C).

At step 506, super-pixel similarity is considered.

Following approaches as known in the art, the similarity ofneighbourhood super pixels is incorporated based on a function:

w(s,s′), for all s,s′∈S.

Information available over the entire set of features and spatialdistance is combined to calculate the similarity function w(. , . )between neighbourhood super-pixels. The Euclidean distance betweenfeatures of two super-pixels is denoted by s, s′ by d(x_(s), x_(s′)) andthe standard deviation across all super-pixels by σ_(x). The spatialEuclidean distance between a pair of super-pixels is given by:

s(s,s′)

and the average distance across all super-pixels by:

d (S).

Combining the feature level information and spatial distance betweensuper-pixels, the similarity function is given by:

${w\left( {s,s^{\prime}} \right)} = {{\exp \left( {- \frac{d\left( {x_{s},x_{s^{\prime}}} \right)}{2\sigma_{x}^{2}}} \right)}{\left( \frac{d\left( {s,s^{\prime}} \right)}{d()} \right)^{- 1}.}}$

At step 508, segment classification occurs. Based on the computedfeatures and weight function we classify the super-pixels into a numberof classes in an unsupervised manner. Let the unknown classes of thesuper-pixels be denoted by Y={y_(s), s∈S}. If there are K segmentspresent in a given image, denoting K classes, we have y_(s)∈{1, 2, . . ., K} for s∈S. The class information given by the joint class probabilityfunction is factorized as:

p(Y)=Π_(s∈S)π(y _(s))Π_(s,s′∈S) R(y _(s) ,y _(s)′)

-   -   where the class prior probabilities are given by π(y_(s)). The        mutual information between a pair of neighbourhood super-pixels        are given by:

R(y _(s) ,y _(s)′)=βw(s,s′)B(y _(s) ,y _(s)′),

β>0 being a tuning parameter controlling the spatial regularization.Here B(y_(s), y_(s)′) is a spatial regularisation function indicatingthe chance of two neighbouring super-pixels to belong to the same class.A diagonal structure of the matrix [B(y_(s), y_(s)′), s, s′∈S] is chosenmaking all the diagonal elements to be identical to 1. Given a fixedclass k, the features are assumed to have a Gaussian distribution withfixed mean μ_(k) and variance-covariance matrix Σk given by:

p(x _(s) |y _(s) =k)=N _(k)(μ_(k),▪_(k)).

Hence, the super-pixel class is predicted by estimating the modelparameters using the Expectation-Maximization algorithm and henceevaluating:

$\left( {{\hat{y}}_{s},{s \in }} \right) = {{ArgMax}_{{\hat{y}}_{s},{s \in }}{\prod\limits_{s \in }{{p\left( x_{s} \middle| y_{s} \right)}{\pi \left( y_{s} \right)}{\prod\limits_{s,{s^{\prime} \in }}{{R\left( {y_{s},y_{s^{\prime}}} \right)}.}}}}}$

The estimated class information of the super-pixels is used to mergesuper-pixels of the same class level to get different segments. Forthree selected examples of images, the results of the proposed strategyafter the dilation and segmentation are shown, for example, in FIG. 4.

In some examples, the text detection strategy of the approachesdescribed herein use an ensemble of the CNN models to probe in each ofthe detected segments to extract texts of various sizes. The task oftext detection in an image is very similar to object detection, wherethe text can be treated as an object. Hence, all object detection modelscan be used by making them binary classifiers—text (word level) andnon-text. But all these object classifiers have their own limitations.

Sometimes the image has a large amount of text compacted in a regionforming a text cluster. Detecting these words separately becomes hardfor conventional object detection techniques as they are trained torecognize a few numbers of separable objects in an image.

Text in a single image can vary in both font-sizes and font-styles in asingle image. Although it is sometimes claimed that most objectdetection methods are scale invariant, the results say otherwise asknown to those skilled in the art. Text in most cases, unlike objects,has a rectangular aspect ratio. Wide kernels will capture informationabout the aspect ratio of text objects better than square kernels. Anensemble of multiple CNN based models ensures a different level ofinformation will be captured by different kind of models resulting inbetter coverage in information gathered from image.

The models are then stitched together using selective non-maxsuppression algorithm. Non-Maximal Suppression removes multipleoverlapping boxes detected for the same text location and keeps the onewith the highest probability. Selective non-maximal suppression does thesame but also takes into account the accuracy of the model from whichthe bounding box has been generated, giving it higher preference.Predictions from models which have a higher accuracy are preferred overothers even if the individual probability might be slightly smaller.

Non-Maximal suppression approaches are now described. Let us assume thatthere are n models and the number of bounding boxes predicted by j^(th)model be n_(j). Let K be the list of all bounding boxes such that k_(ij)is the i^(th) bounding box predicted by model j with p_(k)ij being theprobability of that bounding box containing text.

Let

represent a sorted ordering of all these bounding boxes. That implies

|

|K|=n ₁ +n ₂ +n ₃ + . . . +n _(n).

One example of an NMS algorithm is:

Algorithm 1 NMS Algorithm procedure NMSALGORITHM(k, p, nmsThreshold)

 = sort(k, p, desc)

 sort k based on prob, p in desc order for i = 1,...,| 

 | do for j = i + 1,...| 

 | do if IOU( 

 _(i), 

 _(j)) > nmsThreshold then

 ,pop(j) end if end for end for return 

end procedure

Selective non-maximal suppression (sNMS) is now described. Let M_(q)denote the model with the highest accuracy a_(q) among all other modelsM_(i) where i∈{1, 2, . . . , n}. Let P_(t) be the thresholdprobability—the probability that the bounding box is considered a truetext box predicted by a model. P_(t) is kept high for M_(q), say P_(t)hwhile P_(t) is kept slightly lower for the other n 1 models, say P_(t)l.The bounding boxes predicted by each of the models are first filteredusing this. After that, the probability of all the n_(q) predicted boxesof model M_(q) is assigned to 1, while the probability of other boxes isleft untouched. Post this reassignment of probabilities, NMS isperformed on all the predicted boxes from all n models.

One example of a sNMS algorithm is:

Algorithm 2 Selective NMS Algorithm procedure SELECTIVENMSALGORITHM(a,P_(t) _(l) , P_(t) ₂ , k, p, nmsThreshold) q = max(a) for k_(q) do

 bounding boxes in q^(th)model remove boxes where p_(q) < P_(t) _(h) endfor for k_(q) do

 remaining bounding boxes in q^(th)model p_(q) = 1 end for for r =1,...,q − 1,q + 1,...,m do for k_(r) do remove boxes where p_(r) < P_(t)_(l) end for end for bbs =NMSalgorithm(k, p, nmsThreshold) return bbsend procedure

It will be appreciated that selective NMS ensures that the text boxespredicted with high probability by a model with the highest accuracywill always have priority over similar text boxes predicted with highprobability by other models.

In one practical application, multiple models are deployed and used todetect text from an image, which are stitched together using theselective-NMS algorithm. Multiple pre-trained models, (e.g., asdeveloped in Liao et al.) were used to detect text boxes from images.For selective-NMS, corresponding to the model with the highest accuracyof 0:9 was set as a probability threshold above which a bounding box wasconsidered a true text box with high confidence. The same parameter wasset to 0.8 for the other models. NMS Threshold, the ratio ofintersection area of two text boxes to union area of them (IOU), was setto 95%, i.e. with IOU above 95% between two text boxes, they areconsidered to contain the same text. For text recognition, 9 milliontext images were synthesized using a SynthText tool for various size,style and background of text for training. The full training set was runin a computer with standard K80 GPU and average execution time fordetecting text in a single image is recorded to be around 0:15 s.

The ICDAR2013 dataset consists of images where the user is explicitlydirecting the focus of the camera on the text content of interest in areal scene. The product image dataset used (a Walmart dataset), on theother hand, consisted of images of items taken from a high-resolutioncamera and have no background (white). By converting the image togrey-scale the entropy of the images was calculated in areas where thetext is present. The average entropy of a sample of images fromICDAR2013 dataset was around 7:0 while that of images from the Walmartdataset was around 6:0 with 6:5 marking a demarcation boundary forseparating the two datasets.

Some experimental results are now described. ICDAR2013 containshigh-resolution real-world images. The models had been trained on theICDAR2013 training set and then tested on the ICDAR2013 validation set.The results from all the models were then passed through selective NMSand the final bounding boxes are used for computing the metrics forprecision, recall and f-score. Table. 1 summarizes and compares theresults of the approaches provided herein (denoted by “Ensemble”) withother methods from other sources or products (MMser, Textflow, FCN, SSD,Textboxes and Textboxes++).

TABLE 1 Datasets ICDAR2013 Methods P R F Time/s MMser 0.86 0.70 0.770.75 (Zamberletti, Noce, and Gallo 2014) TextFlow 0.85 0.76 0.80 1.4(Tian et al. 2015) FCN 0.88 0.78 0.83 2.1 (Zhang et al. 2016) SSD 0.800.60 0.68 0.1 (Liu et al. 2016) Textboxes 0.86 0.74 0.80 0.09Textboxes++ 0.86 0.74 0.80 0.10 Ensemble 0.83 0.77 0.80 0.15

In Table 1: Text localization on ICDAR2013. “Time” refers to theexecution time of the computer code implementing an approach. P, R and Frefer to precision, recall and F-measure respectively. Precision=numberof correctly predicted text boxes/total number of predicted textboxes.Recall=number of correctly predicted text boxes/total number of textboxes in the image. F measure=2*((precision*recall)/(Precision+recall)).

The ensemble model proved herein was also tested on a dataset containingpublicly available product images on the Walmart, Inc. website. Theseare high resolution and high entropy images of the front face ofprocessed food items used on a daily basis by consumers. The predictedtext region bounding boxes enclose regions containing texts of multiplesizes and mixed font types in the same image. This is particularlyimportant for product images as the product labels often contain textsof multiple fonts. The proposed text detection strategy alsosuccessfully detects text regions when the text is moderately rotated orcurved due to the shape of the product package, e.g., a can or a bottle(see the 2nd, 3rd and 4th images in the bottom row in FIG. 4). The useof wide kernels is useful in detecting horizontal text boxes and on topof it, the image segmentation and CNN ensemble network consider imageconvolution filters at multiple scales and rotation angles. Thiscontributes to ensuring that the text box detection accuracy isinvariant at least under limited distortion and rotation of thehorizontal orientation of the text.

The models trained on ICDAR2013 training set were used on 50 images fromthis dataset where the ground truth boxes are known. The main differencebetween the images in this dataset and the other publicly availabledatasets is that the images have no background noise that is usuallypresent in scene text. However, multiple texts are usually present in asmall region of the image along with various other objects resulting inhigh local entropy. Most of the models currently available performpoorly on detecting text in such regions in the image. In such cases,the approaches provided herein perform better than the existing ones interms of precision, recall as well as f-score. In the case of ICDAR2013dataset, the model has performed at par with the existing modelscurrently available, but this improves drastically in the case of thedataset containing high entropy images. The precision is at least 6%higher than the existing methods while recall is higher by around 15%.Table 2 compares the results achieved on the Walmart high entropyimages.

TABLE 2 Datasets High Entropy Images Methods P R F Textboxes 0.867 0.2640.405 Textboxes++ 0.831 0.311 0.453 Ensemble 0.920 0.467 0.619

Table 2 shows text localization results on high entropy image dataset.P, R and F refer to precision, recall and F-measure respectively.Precision=number of correctly predicted text boxes/total number ofpredicted textboxes. Recall=number of correctly predicted textboxes/total number of text boxes in the image. Fmeasure=2*((precision*recall)/(Precision+recall)).

The approaches provided herein provide algorithms, which employ anensemble of multiple fully convolutional networks preceded by an imagesegmenter for text detection. These approaches are highly stable andparallelizable and can detect words of varied sizes in an image which isvery high on entropy. Comprehensive evaluations and comparisons onbenchmark datasets clearly validate the advantages of these approachesin three related tasks including text detection, word spotting and endto-end recognition. It even exhibits better performance than the Textboxand Textbox++ products/approaches in detecting graphical text in animage. The ICDAR2013 dataset images have real-world contents andbackground noise surrounding the true text regions, unlike the Walmarthigh entropy images, where the challenge is largely the presence ofmultiple textual elements within small regions resulting higher entropy.The approaches provided herein are particularly targeted to work on suchhigh entropy text regions and hence performs very well on high EntropyImages. However, a more targeted background removal strategy, imagesegmentation and text candidate pre-filtering using text region specifickey point identification and feature descriptions such as Stroke widthdescriptors, Maximally Stable Extremal Region descriptors will enhancethe performance of the CNN ensemble model even more.

In some embodiments, one or more of the exemplary embodiments includeone or more localized IoT devices and controllers (e.g., included withor associated with the various scanners, sensors, cameras, or robotsdescribed herein). In another aspect, the user electronic devices orautomated vehicles may be seen as an IoT device. As a result, in anexemplary embodiment, the localized IoT devices and controllers canperform most, if not all, of the computational load and associatedmonitoring and then later asynchronous uploading of data can beperformed by a designated one of the IoT devices to a remote server. Inthis manner, the computational effort of the overall system may bereduced significantly. For example, whenever localized monitoring allowsremote transmission, secondary utilization of controllers keeps securingdata for other IoT devices and permits periodic asynchronous uploadingof the summary data to the remote server. In addition, in an exemplaryembodiment, the periodic asynchronous uploading of data may include akey kernel index summary of the data as created under nominalconditions. In an exemplary embodiment, the kernel encodes relativelyrecently acquired intermittent data (“KRI”). As a result, in anexemplary embodiment, KRI includes a continuously utilized near termsource of data, but KRI may be discarded depending upon the degree towhich such KM has any value based on local processing and evaluation ofsuch KM. In an exemplary embodiment, KRI may not even be utilized in anyform if it is determined that KM is transient and may be considered assignal noise. Furthermore, in an exemplary embodiment, the kernelrejects generic data (“KRG”) by filtering incoming raw data using astochastic filter that provides a predictive model of one or more futurestates of the system and can thereby filter out data that is notconsistent with the modelled future states which may, for example,reflect generic background data. In an exemplary embodiment, KRGincrementally sequences all future undefined cached kernals of data inorder to filter out data that may reflect generic background data. In anexemplary embodiment, KRG incrementally sequences all future undefinedcached kernals having encoded asynchronous data in order to filter outdata that may reflect generic background data. In a further exemplaryembodiment, the kernel will filter out noisy data (“KRN”). In anexemplary embodiment, KRN, like KM, includes substantially acontinuously utilized near term source of data, but KRN may be retainedin order to provide a predictive model of noisy data. In an exemplaryembodiment, KRN and KM, also incrementally sequences all futureundefined cached kernels having encoded asynchronous data in order tofilter out data that may reflect generic background data.

Those skilled in the art will recognize that a wide variety ofmodifications, alterations, and combinations can be made with respect tothe above described embodiments without departing from the scope of theinvention, and that such modifications, alterations, and combinationsare to be viewed as being within the ambit of the inventive concept.

What is claimed is:
 1. A system, comprising: a data storage unitincluding a trained first mathematical model and a trained secondmathematical model, wherein the first mathematical model is differentand distinct from the second mathematical model; an electroniccommunication network; an electronic server coupled to the electroniccommunication network that hosts a web-based catalog ordering systemthat receives electronic orders from customers; a control circuit thatis coupled to the electronic communication network and the data storageunit, wherein the control circuit is configured to: receive an image ofa product from a vendor via the electronic communication network, theproduct proposed by the vendor to be sold to retail customers; performsegmentation on the image to divide the image into individual regions ofhomogeneous pixels, wherein the segmentation is effective to create asegmented image; apply the segmented image to the first mathematicalmodel to produce a first group of text regions and apply the segmentedimage to the second mathematical model to obtain a second group of textregions, wherein each of the text regions are regions includes potentialtext or symbols; apply a selective non-maximal suppression (sNMS)algorithm to the first group of text regions and the second group oftext regions to obtain a final group of text regions, the selective NMSalgorithm being effective to remove overlapping regions at the samelocation or general location in the image, the selective NMS algorithmselecting text regions most likely to include text; analyzeinformational content of the text regions and perform an action thatutilizes the informational content of the text regions, the action beingone or more of: applying the informational content to the web-basedordering catalog, receiving a customer order from a customer as a resultof the informational content, and physically fulfilling the receivedcustomer orders using an automated order fulfillment system to shipitems in the order to the customer; scanning the informational contentfor offensive content, and sending a message to a vendor via theelectronic network to remove the offensive content or removing the itemfrom a retail store or warehouse when an item including the offensivecontent exists in the retail store or warehouse.
 2. The system of claim1, wherein the item that is removed from the retail store or warehouseis removed using an automated vehicle to navigate to the item and removethe item from a display unit or storage unit.
 3. The system of claim 3,wherein the automated vehicle is an automated ground vehicle or anaerial drone.
 4. The system of claim 1, wherein the first group of textregions, the second group of text regions, and the final group of textregions comprise text boxes.
 5. The system of claim 1, wherein the firstmathematical model and the second mathematical model are convolutionalneural networks (CNNs).
 6. The system of claim 1, wherein the firstmathematical model and the second mathematical model are trained usingtraining images.
 7. The system of claim 1, further comprising a camera,the camera coupled to the electronic communication network, the cameraconfigured to obtain the image.
 8. A method, the method comprising:providing a data storage unit that includes a trained first mathematicalmodel and a trained second mathematical model, wherein the firstmathematical model is different and distinct from the secondmathematical model; providing an electronic communication network and anelectronic server that is coupled to the electronic communicationnetwork, the server hosting a web-based catalog ordering system thatreceives electronic orders from customers; providing a control circuitthat is coupled to the electronic communication network and the datastorage unit; at the control circuit, receiving an image of a productfrom a vendor via the electronic communication network, the productproposed by the vendor to be sold to retail customers; at the controlcircuit, performing segmentation on the image to divide the image intoindividual regions of homogeneous pixels, wherein the segmentation iseffective to create a segmented image; at the control circuit, applyingthe segmented image to the first mathematical model to produce a firstgroup of text regions and apply the segmented image to the secondmathematical model to obtain a second group of text regions, whereineach of the text regions are regions includes potential text or symbols;at the control circuit, applying a selective non-maximal suppression(sNMS) algorithm to the first group of text regions and the second groupof text regions to obtain a final group of text regions, the selectiveNMS algorithm being effective to remove overlapping regions at the samelocation or general location in the image, the selective NMS algorithmselecting text regions most likely to include text; at the controlcircuit, analyzing informational content of the text regions and performan action that utilizes the informational content of the text regions;wherein the action being one or more of: applying the informationalcontent to the web-based ordering catalog, receiving a customer orderfrom a customer as a result of the informational content, and physicallyfulfilling the received customer orders using an automated orderfulfilment system to ship items in the order to the customer; scanningthe informational content for offensive content, and sending a messageto a vendor via the electronic network to remove the offensive contentor removing the item from a retail store or warehouse when an itemincluding the offensive content exists in the retail store or warehouse.9. The method of claim 8, wherein the item that is removed from theretail store or warehouse is removed using an automated vehicle tonavigate to the item and remove the item from a display unit or storageunit.
 10. The method of claim 9, wherein the automated vehicle is anautomated ground vehicle or an aerial drone.
 11. The method of claim 8,wherein the first group of text regions, the second group of textregions, and the final group of text regions comprise text boxes. 12.The method of claim 8, wherein the first mathematical model and thesecond mathematical model are convolutional neural networks (CNNs). 13.The method of claim 8, wherein the first mathematical model and thesecond mathematical model are trained using training images.
 14. Themethod of claim 8, further comprising a camera, the camera coupled tothe electronic communication network, the camera configured to obtainthe image.